Erasmus is program by European union that allows students to study on foreign university through an exchange. There are a lot of parameters every student consideres when choosing a destination. In this final project we used dataset provided by European union in order to explore if there are any relationship between students characteristics like age, gender nationality etc. and their destination choice for Erasmus. We have also provided different plots to understand dataset better. Finally we have created a machine learning model that predicts destination for student.
The dataset that we used is one from 2012-2013 academic year that can be found here. The dataset is published directly by European Union. It was created from the statistical reports of the national agencies of the 33 countries participating in the Erasmus+ program (Erasmus decentralised actions) and data provided by Education Audiovisual and Culture Executive Agency (Erasmus centralised actions). The data is generated during the application process of the student and then collected by the respective universities. It contains 267547 observations and has 34 different variables.
Host institution country is one of the most interesting variables to us and we can see that it has a lot of undefined values, around 55 thousand, so we need to filter those out. For both host and home country, values are coded as country codes. However Belgium is coded as three diferent values: “BEDE”, “BEFR” and “BENL” depending on the language area (Dutch, France or German). We are going to merge all of this values to a single one for whole Belgium.
There are 34 different vairables and we are not going to use all of them, so we list ones that are most relevant for our research:
First thing we wanted to explore is to see if there is a difference between number of male and females enrolled in Erasmus. We were expecting to see significant difference as one of the cited papers suggest that there is gender gap. Pie chart we presented here to confirm this assumption.
Next we wanted to see what are the countries with most students goint to Erasmus. In order to not just list them, we decided to present this metric in a Europe map, coloring each country regarding the number of students with home university in that country. We can see that Spain, France and Germany are leading in students enrolled in Erasmus. Surprising thing is to see that Turkey lists very high.
## [1] ES DE FR IT PL TR UK NL CZ PT FI RO AT HU
## [15] BENL GR LT SE DK SK BEFR LV CH BG EE NO HR LU
## [29] CY LI MT BEDE IE SI IS
## 35 Levels: AT BEDE BEFR BENL BG CH CY CZ DE DK EE ES FI FR GR HR HU ... UK
Other thing that was in our interest is the areas in which Erasmus is most popular. The dataset contains codes of each area adn we have used The International Standard Classification of Education to map those codes to names of areas. We have also merged areas that start with same two numbers since those are related and finally displayed statistic in form of bar plot.
To explore data further, we wanted to see age distibution. At that point we noticed that there was a student that attended Erasmus at the age of 93. There were some other unordinary records as 73 and 69 years old students. Despite that we present student distribution by age of 30 where most of the students are. 22 year old students were most frequent among males, and 21 year old students among females. On this plot we can also see that there are more female students in pretty much every category.
Last thing we wanted to explore is what are the 10 most popular universities in Europe among students. This is a simple bar plot that shows universities and number of ERASMUS students enrolled in those universities. Sweden is leading with universities in Stockholm and Linköping, while third place belongs to university in Valencia.
We want to have host country as our outcome variable and see how other variables related to it. There are 34 different variables but not all of them make sanse to include in model. After exploring dataset we decided that we need just a couple of them. Here is the formula of our model:
HOST_INSTITUTION_COUNTRY_CDE ~ STUDENT_NATIONALITY_CDE + STUDENT_AGE_VALUE + STUDENT_SUBJECT_AREA_VALUE + STUDENT_GENDER_CDE
We also provide explanation of why we included every variable:
First thing that comes to our mind when talking about strenght of relationships is linear model aclled by function lm(). However we are not having linear problem and therefore we cannot use this function. So our next option is logistic regression which has categorical variables for its outcome. Only problem here is that we don’t have binary outcome which is usually the case with logistic regression, but multiple classes. Precisely, since host country is our dependent variable we have as many categories as there are countries in that column. So for dataset 2012-2013 there are 33 countries and that is how many classes we have. There is where multinominal model with as many classes as we want comes handy. We use multinom() function from package nnet and have specified data, formula, maximum number of weights and number of iterations. Finally we created a model in R with following command:
model <- multinom(formula = HOST_INSTITUTION_COUNTRY_CDE ~ STUDENT_NATIONALITY_CDE + STUDENT_AGE_VALUE + STUDENT_SUBJECT_AREA_VALUE + STUDENT_GENDER_CDE, data = filtered, MaxNWts=3000, maxit = 20)
We adjusted the model so that is has maximum 3000 weights and 20 iterations.
Even thought we managed to create this model, calculating its summary just didn’t end in reasonable time so we had to take another approach. Only because of this we reduced our dataset so that we have only two outcome categories UK and ES. So we are creating model with only those two classes. Now we can apply logistic regression model since the outcome is binary. Model is created by following command:
model <- glm(formula = HOST_INSTITUTION_COUNTRY_CDE ~ STUDENT_NATIONALITY_CDE + STUDENT_AGE_VALUE + STUDENT_SUBJECT_AREA_VALUE + STUDENT_GENDER_CDE, data = filtered, family = binomial())
This calculation is done much faster so we can explore strenght of realtionships properly.
We decided to use model trees, because using the linear regression didn’t return a pleasant results (R squared was equal to 0.02)
Model trees are grown in much the same way as regression trees, but at each leaf, a multiple linear regression model is built from the examples reaching that node.
Two of the attributes used are categorical data : STUDENT_NATIONALITY_CDE and STUDENT_SUBJECT_AREA_VALUE. And the machine learning algorithm that we are going to use requires the attributes to be nominale, some machine learning algorithms implemented in R studio do the dummification themselves. But not in our case, therefore we needed to one hot code this two attributes
Model tree improves on regression trees by replacing the leaf nodes with regression models. This often results in more accurate results than regression trees, which use only a single value for prediction at the leaf nodes. the regression tree algorithm in Rstudio is r part
We used M5’ algorithm (M5-prime) by Wang and Witten, which is an enhancement of the original M5 model tree algorithm proposed by Quinlan in 1992.
We decided to train the model with 80% of the dataset and use the remaining 20% as test data. The model uses the test data for trying to predict the host institution. As measure for performance of our result we use the mean absoulte error, which is the average error between our prediction and the actual values. In addition we use the accuracy we get from our prediction.
dt = sort(sample(nrow(df_reg), nrow(df_reg)*.8))
train<-df_reg[dt,]
test<-df_reg[-dt,]
# Train
m.m5p <- M5P(HOST_INSTITUTION_COUNTRY_CDE ~ ., data = train)
# Predict
p.m5p <- predict(m.m5p, test)
Summary of our logistic regression model is presented below:
Call:
glm(formula = HOST_INSTITUTION_COUNTRY_CDE ~ STUDENT_NATIONALITY_CDE +
STUDENT_AGE_VALUE + STUDENT_SUBJECT_AREA_VALUE + STUDENT_GENDER_CDE,
family = binomial(), data = filtered)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.8644 -0.8695 -0.5691 1.0946 3.1778
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.065424 0.559370 -0.117 0.906892
STUDENT_NATIONALITY_CDEBE -0.510193 0.089534 -5.698 1.21e-08 ***
STUDENT_NATIONALITY_CDEBG -0.201445 0.149165 -1.350 0.176862
STUDENT_NATIONALITY_CDECH 0.195356 0.115699 1.688 0.091320 .
STUDENT_NATIONALITY_CDECY -0.082228 0.242261 -0.339 0.734295
STUDENT_NATIONALITY_CDECZ 0.298433 0.097206 3.070 0.002140 **
STUDENT_NATIONALITY_CDEDE -0.015585 0.073135 -0.213 0.831253
STUDENT_NATIONALITY_CDEDK 1.006396 0.107310 9.378 < 2e-16 ***
STUDENT_NATIONALITY_CDEEE 0.249332 0.197660 1.261 0.207159
STUDENT_NATIONALITY_CDEES 4.032322 0.126069 31.985 < 2e-16 ***
STUDENT_NATIONALITY_CDEFI 0.720312 0.096938 7.431 1.08e-13 ***
STUDENT_NATIONALITY_CDEFR 0.380214 0.073581 5.167 2.38e-07 ***
STUDENT_NATIONALITY_CDEGR -0.546220 0.120104 -4.548 5.42e-06 ***
STUDENT_NATIONALITY_CDEHR -0.792309 0.251065 -3.156 0.001601 **
STUDENT_NATIONALITY_CDEHU -0.139422 0.127126 -1.097 0.272765
STUDENT_NATIONALITY_CDEIE -0.780883 0.133951 -5.830 5.55e-09 ***
STUDENT_NATIONALITY_CDEIS 0.167390 0.253194 0.661 0.508539
STUDENT_NATIONALITY_CDEIT -0.972142 0.075588 -12.861 < 2e-16 ***
STUDENT_NATIONALITY_CDELI -10.305356 84.438362 -0.122 0.902863
STUDENT_NATIONALITY_CDELT -0.453280 0.153101 -2.961 0.003070 **
STUDENT_NATIONALITY_CDELU -0.280322 0.391914 -0.715 0.474446
STUDENT_NATIONALITY_CDELV -1.072289 0.245016 -4.376 1.21e-05 ***
STUDENT_NATIONALITY_CDEMT 2.796590 0.416106 6.721 1.81e-11 ***
STUDENT_NATIONALITY_CDENL 0.574346 0.086693 6.625 3.47e-11 ***
STUDENT_NATIONALITY_CDENO 1.036971 0.125072 8.291 < 2e-16 ***
STUDENT_NATIONALITY_CDEPL -0.978837 0.087812 -11.147 < 2e-16 ***
STUDENT_NATIONALITY_CDEPT -1.108624 0.106818 -10.379 < 2e-16 ***
STUDENT_NATIONALITY_CDERO -0.956060 0.133953 -7.137 9.52e-13 ***
STUDENT_NATIONALITY_CDESE 1.009450 0.097950 10.306 < 2e-16 ***
STUDENT_NATIONALITY_CDESI -0.754328 0.171553 -4.397 1.10e-05 ***
STUDENT_NATIONALITY_CDESK -0.491452 0.135884 -3.617 0.000298 ***
STUDENT_NATIONALITY_CDETR -0.847052 0.104855 -8.078 6.57e-16 ***
STUDENT_NATIONALITY_CDEUK -4.049174 0.226146 -17.905 < 2e-16 ***
STUDENT_AGE_VALUE -0.004137 0.005066 -0.817 0.414192
STUDENT_SUBJECT_AREA_VALUE1 -0.915442 0.611400 -1.497 0.134318
STUDENT_SUBJECT_AREA_VALUE10 -0.049669 0.705469 -0.070 0.943871
STUDENT_SUBJECT_AREA_VALUE14 -0.312324 0.546546 -0.571 0.567695
STUDENT_SUBJECT_AREA_VALUE2 0.337524 0.714199 0.473 0.636505
STUDENT_SUBJECT_AREA_VALUE21 0.076313 0.545027 0.140 0.888647
STUDENT_SUBJECT_AREA_VALUE22 -0.175862 0.543294 -0.324 0.746168
STUDENT_SUBJECT_AREA_VALUE3 0.169626 0.553877 0.306 0.759412
STUDENT_SUBJECT_AREA_VALUE31 -0.609875 0.543833 -1.121 0.262101
STUDENT_SUBJECT_AREA_VALUE32 -0.872924 0.547677 -1.594 0.110966
STUDENT_SUBJECT_AREA_VALUE34 -0.735747 0.543484 -1.354 0.175813
STUDENT_SUBJECT_AREA_VALUE38 -0.155832 0.544337 -0.286 0.774665
STUDENT_SUBJECT_AREA_VALUE4 0.179354 0.756269 0.237 0.812535
STUDENT_SUBJECT_AREA_VALUE42 -0.036296 0.548599 -0.066 0.947250
STUDENT_SUBJECT_AREA_VALUE44 0.010724 0.546164 0.020 0.984334
STUDENT_SUBJECT_AREA_VALUE46 0.179673 0.551668 0.326 0.744658
STUDENT_SUBJECT_AREA_VALUE48 -0.159184 0.549616 -0.290 0.772102
STUDENT_SUBJECT_AREA_VALUE5 -1.524772 0.651268 -2.341 0.019220 *
STUDENT_SUBJECT_AREA_VALUE52 -0.423912 0.544501 -0.779 0.436254
STUDENT_SUBJECT_AREA_VALUE54 -0.465939 0.559786 -0.832 0.405210
STUDENT_SUBJECT_AREA_VALUE58 -0.970780 0.545905 -1.778 0.075356 .
STUDENT_SUBJECT_AREA_VALUE6 -1.042265 0.592845 -1.758 0.078735 .
STUDENT_SUBJECT_AREA_VALUE62 -0.988229 0.557883 -1.771 0.076496 .
STUDENT_SUBJECT_AREA_VALUE64 -2.723809 0.685860 -3.971 7.15e-05 ***
STUDENT_SUBJECT_AREA_VALUE72 -1.195239 0.546208 -2.188 0.028651 *
STUDENT_SUBJECT_AREA_VALUE76 -0.523986 0.563750 -0.929 0.352648
STUDENT_SUBJECT_AREA_VALUE8 0.368579 1.075075 0.343 0.731718
STUDENT_SUBJECT_AREA_VALUE81 -1.162951 0.547930 -2.122 0.033801 *
STUDENT_SUBJECT_AREA_VALUE84 -1.123646 0.681346 -1.649 0.099116 .
STUDENT_SUBJECT_AREA_VALUE85 -1.171075 0.645957 -1.813 0.069842 .
STUDENT_SUBJECT_AREA_VALUE86 1.129102 1.224562 0.922 0.356505
STUDENT_SUBJECT_AREA_VALUE90 -10.391941 84.478372 -0.123 0.902097
STUDENT_SUBJECT_AREA_VALUE99 0.229852 0.581121 0.396 0.692450
STUDENT_GENDER_CDEM 0.150036 0.023913 6.274 3.51e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 63615 on 48489 degrees of freedom
Residual deviance: 51770 on 48423 degrees of freedom
AIC: 51904
In the rightmost column we see the p-values as well as indicator of significance of eace independent variable. We can see that age makes no impact on the output variable since its p-value is too big. Gender, however, has very small p-value therefore it is a significant predictor. When it comes to study area, it can be easily concluded that this variable does not play significant role in estimating host country. Finally interesting thing to see is that most of categories in nationality are actually significant so we can say that it is correlated with dependent variable.
R squares is usually the measurment that represents variance covered by model. Logisttic regression model uses maximum likelihood to fit the function to data, and therefore does not minimize sqaured error. For that reason R sqaured is not outputed in summary. However we can use following formula to get sense of covered variance:
1-(model1$deviance/model1$null.deviance)
By deviding residual deviance and null deviance we are basically getting R squared and in our case it is around 18%. We can concluded that variance is poorly covered by this model.
Reviewing the results of our prediction approach we have to state that the M5P algorithm is not suited for this classification problem. With 27,04% accuracy less than a thrid of all instances were classified correctly. The mean absoulte error hints in a similar direction.
=== Summary ===
Correlation coefficient 0.1647
Mean absolute error 8.2327
Root mean squared error 9.5352
Relative absolute error 97.0758 %
Root relative squared error 98.6351 %
Total Number of Instances 76756
This implies that, on average, the difference between our model’s predictions and the true HOST_INSTITUTION_COUNTRY_CDE score was about 8.23. With this result the given approach seems to be not appropriate. On an other note it can be that trying to the destition of a student with the given dataset is not a reasonable task.
This project gave us insight into the composition of the ERASMUS+ process. While working with the data we came to know how this exchange program is seen from an administrative point of view. At first we had a look at the different variables and encoutered information about funds, personal information and locations. It was interesting to see how the students participating in this journey are distributed around the continent. One of the most surprising things to see is the age of some students.
After getting a feeling for the dataset we moved forward with the statistical assement of the data. The greatest challange was to find the right methods because the topology of the data gave no hint how the data is distributed in respect to our stated question. While analysing the relationships within in the dataset we found out that only a subset of the features is significantly related to our response variable.
Applying the m5’ algorithm to our model brought us to the conclusion that while not being the worst option this algorithm is not suited to predict the host institution. This migth partly because of the algorithm but can also be because of the dataset not being suited for such predictions.
Like we addressed in the conculsion this dataset from the European Union has a strong administative point of view. What we tried to predict and assess in this project is, in our opinion, not well captured in these variables. Future work could use this dataset for gaining insight into the decsisions of the students but would need additional information about the people behind the observations. Choosing your destitantion for ERASMUS is really a very personal one. Strongly influenced by personal experiences and preferences. Because of that it would be instresting to see if this information could improve the performance of a classification.
For future research we would suggest combining datasets from multiple years to get to know affections change and how does whole program evolve. Also we would propose applying different machine learning algorithms in order to obtain better results in predicting dependent variable. Finally merging multiple datasets would allow scientist to know which universities are doing good job by increasing their popularity over years, so others can learn from their progress.